This assignment is for ETC5521 Assignment 1 by Team cassowary comprising of Weihao Li and Dang Thanh Nguyen.

1 Introduction and motivation

General Data Protection Regulation (GDPR) is a regulation adopted in EU law on 14 April 2016, which aim to “protect fundamental rights and freedoms of natural persons and in particular their right to the protection of personal data” (Regulation, 2016). Since its establishment, Tikkinen-Piri, Rohunen, & Markkula (2018) stated that it would be a challenge to companies who lack of awareness of the GDPR’s practical implications. In other words, inconsistent practical and technical implementation of GDPR across companies could be anticipated. By August 2020, more than 49 companies have been fined at least 100 thousand euros. Most of these fines were issued to IT companies (CoreView, 2020). The most famous and largest GDPR fine was the 50 million euros fine issued to Google.

Meanwhile, Lewis, Conleya, & Lee-Makiyama (2020) suggested that EU was in a bad position on the battle to its digital future. Besides, part of the reasons behind the GDPR establishment and the later digital services taxes imposed by major EU countries were an indication that EU was waving the flag of digital sovereignty. Thus, to against digital hegemony, it intentionally targeted non-local held IT corporations.

Available data of GDPR fines from Privacy Affairs (2020), provides an opportunity for us to gain insight into the privacy issues and possible abuse of articles of GDPR in EU. This dataset was originally adopted by Tidy Tuesday (2020), which is a social weekly data project in R. We will provide the workflow of data collection, data cleaning and data analysis in the following sections. A potential limitation of this research is a great proportion of controller names are missing from the dataset which introduces extra difficulties of drawing accurate conclusions with respect to the company nationality. Another limitation is we do little research on the political and legal world of the EU, which reduce the accuracy of our conclusions.

2 Data description

The data is scraped from Privacy Affairs (2020) by Ellis Hughes (2020), posted on Tidy Tuesday (2020). We adapted the scripts they used and removed duplication of GDPR fines due to multiple sources. We collected data from 2018-05-12 to 2020-08-21.

Abnormal values including fines with 0 price, which indicate ongoing trails and fines before the establishment of the GDPR, which are obviously missing value, are set to be NA. All records with missing date are set to NA. We also further process inconsistency in corporation names. There are unnamed organizations that were fined by GDPR. In this report, all unnamed organizations are labelled as “Unknown”.

The dataset has some limitations:
- Not all cases are presented in the dataset. The data only includes finalised cases and cases that are made public. This can lead to bias in further analysis - The controller of data, in many cases, are vaguely described, or not disclosed. This trend can lead to misunderstand and bias in further analysis.

The data consist of two smaller dataset:

  1. GDPR_Violations.tsv contains information of 347 fines and penalties which data protection authorities in EU have imposed under EU General Data Protection Regulation (GDPR). The dataset contains variables including the country and authority, the date of violation, the fine price of violation in Euros, the controller that violate the rule, … Detailed information of variable is presented below.

  2. GDPR_text.tsv contains 99 articles of actual GDPR legal document.

2.1 Data Dictionary

2.1.1 gdpr_violations.tsv

variable class description
id integer Idetifier for fine/violation
picture character SVG image of violation country flag
name character Name of country where violation was enforced
price integer Fine price in Euros (€)
authority character Authority that enacted the violation
date character Date of violation
controller character Controller of data - the violator
article_violated character Specific GDPR Article violated (see the gdpr_text.tsv data for specifics)
type character Type of violation
source character Original source (URL) of fine data
summary character Summary of violation

2.1.2 gdpr_text.tsv

variable class description
chapter double GDPR Chapter Number
chapter_title character Chapter title
article double GDPR Article number
article_title character Article title
sub_article double Sub article number
gdpr_text character Raw text of article/subarticle
href character URL to the raw text itself

2.2 Research questions

The overall aim of this research is to reveal the privacy issues in cyberspace within the EU and examines the application of the General Data Protection Regulation in practice.

Three secondary questions are addressed in this research:

  1. What were the characteristics of GDPR fines across countries, time and corporations? Did certain corporations being targeted by GDPR authorities?
  2. Which articles and regulations were more likely to be violated? And how much they commonly charged?
  3. What were the key words in each article, especially articles referenced the most in GDPR fines? How did it reflect the privacy issues in the EU?

3 Analysis and findings

3.1 Characteristics of GDPR fines across countries, time and corporations

In this part, the researchers want to find out: What were the characteristics of GDPR fines across countries, time and corporations? Did certain corporations being targeted by GDPR authorities?

Figure 3.1 shows the price of GDPR fines from 2018-2020. We can notice the high price clustering (dots in red) between Sep 2019 and Apr 2020. These outliers include Poland, Austria, Germany, Italy, Sweden and Netherlands. If we further research on the spatial patterns of these outliers in Figure @(fig:outliers), we can see that these countries are all connected to each other. We suspect there are political reasons behind this phenomenon, which is worth for future study. Other than that, the reason we use log 10 scale of the price is because price is an extremely skewed distribution. Using log scale of the price can provide us with better data representation.

Boxplot of GDPR fines across time, 2018-2020

Figure 3.1: Boxplot of GDPR fines across time, 2018-2020

Spatial location of outliers from 09,2019 - 03,2020

Figure 3.2: Spatial location of outliers from 09,2019 - 03,2020

Figure 3.3 shows the number of GDPR violations that got fined from 2018 to 2020. Time periods when high number of GDPR fines were recorded are painted in red in order to highlight possible trend. The last quarter of 2019 and the first quarter of 2020 saw the highest number of GDPR violations. This finding is consistent with the finding in the previous box plot, which again indicates suspicious political moves.

Number of GDPR violations by year, 2018-2020

Figure 3.3: Number of GDPR violations by year, 2018-2020

From Figure 3.4, we can clearly observe a difference between developed countries and developing countries. Generally, countries with high GDP are more likely to issue large GDPR fines. We can also see Germany, the leader of EU, have a significantly large range of price of GDPR fines. It could be a sign of fair GDPR enforcement.

Boxplot of GDPR fines by country

Figure 3.4: Boxplot of GDPR fines by country

Figure 3.5 demonstrates the number of fined GDPR violations by country from 2018 to 2020. While all country in the EU has less than 35 cases of GDPR violations except for Spain, which has nearly 100 cases, triple than that of Hungary, the second highest. It is an interesting trend that should be explored by further research.

Figure 3.5: Number of GDPR violations by country, 2018-2020

Table 3.1: Largest cumulative GDPR fines by corporation
Controller Cumulative price Count Countries Date
Google 57000000 2 France, Sweden 2019-1, 2020-3
TIM - Telecom Provider 27800000 1 Italy 2020-2
Austrian Post 18000000 1 Austria 2019-10
Wind Tre S.p.A. 16700000 1 Italy 2020-7
Deutsche Wohnen SE 14500000 2 Germany 2019-10
Eni Gas e Luce 11500000 2 Italy 2020-1
1&1 Telecom GmbH 9550000 1 Germany 2019-12
National Revenue Agency 2628100 2 Bulgaria 2019-8, 2019-9
Allgemeine Ortskrankenkasse 1240000 1 Germany 2020-6
Vodafone España 918000 19 Spain 2019-3, 2019-4, 2019-6, 2019-10, 2019-11, 2020-1, 2020-2, 2020-3
Note:
Companies with background in red are owned by corporations or individuals outside EU
Table 3.2: GDPR fines issued to Vodafone
Controller Countries Cumulative_price Count Date
Vodafone España Spain 918000 19 2019-10, 2019-4, 2019-6, 2019-3, 2019-11, 2020-1, 2020-2, 2020-3
Vodafone ONO Spain 96000 2 2019-6, 2020-2
Vodafone Romania Romania 7150 2 2020-2, 2020-3

Table 3.1 shows us the information about the ten corporations that have accumulated the most fines since the introduction of GDPR. Notice only three of them are non-local corporations, which suggests we can not reject the hypothesis of multinational corporation is being fairly treated by GDPR.

Google was fined 2 times and had to pay a fortune of 57 million euros. It is worth notice that Vodafone España was fined 19 times, the highest number of financial penalties received. A closer look at the Britist telecommunication company is provided in table 3.2. We discovered another fact: Not only did Vodafone España receive a very high number of fines, its branch, Vodafone ONO also receive 2 fines. Together, Vodafone violated the GDPR 21 times in Spain, which is a fifth of the country’s total number of violations.

Overall, the last quarter of 2019 and the first quarter of 2020 saw the highest number of GDPR violations. Since the introduction of GDPR, most country has less than 30 violations, except for Spain. The country recorded about 100 GDPR financial penalties, 21 of which from Vodafone. It seems that Vodafone is targeted by the Spain authority.

3.2 Analysis of violated article

In this part, the researchers want to find out: Which articles and regulations were more likely to be violated? And how much they commonly charged?

Figure 3.6 answer the first question. Article 5 - “Principles relating to processing of personal data”, Article 6 - “Lawfulness of processing” and Article 32 - “Security of processing” are most likely to get violated by controllers. In fact, more than half of all violations are related to this three article. It indicates that the “Processing” of personal data in EU has issues, which lead to high number of violations. Further research could explore this issues and provide a more detail and clearer view.

Number of violations by GDPR articles

Figure 3.6: Number of violations by GDPR articles

Figure 3.7 shows us violation of Article 5, 6 and 32 will lead to a relatively large fine, and with a high chance the amount of the fine will falls between 10000-100000 euros.

GDPR fines by article

Figure 3.7: GDPR fines by article

3.3 Text mining GDPR articles

In this part, the researchers want to find out: What were the key words in each article, especially articles referenced the most in GDPR fines? How did it reflect the privacy issues in the EU?

3.3.1 Overview of words in GDPR articles

Figure 3.8 reflects the key concerns in GDPR. From the word cloud, we can see some regular words related to law such as “article”, “regulation”, “paragraph” and “pursuant”. We can also find words related to data governance, like “processing”, “processor”, “authority”, “supervisory” and “commission”. It seems like the GDPR provides a framework to govern the general data usage within EU and empower some authorities to supervise companies. More importantly, there are words explaining the spirit of GDPR, which are “freedom”, “personal”, “rights” and “public”. It tells us GDPR concerns about the personal and public freedom. Besides, we feel surprise that “privacy” is not a key words in GDPR.

Figure 3.8: Word cloud of GDPR

3.3.2 Words in Article 5, 6 and 32 respectively

In Table 3.3, there are words that repeatedly occurs. They are “data”, “process”, “personal” and “controller”. It suggests the most common privacy issue in EU is illegal processing of personal data. We can also find the most different words are “access” in Article 32, “subject” in Article 6 and “purposes” and “manner” in Article 5. It reflects slightly different core topics in these 3 articles. However, they are very similar overall.

Table 3.3: Top 5 most common words for article 5, 6 and 32 respectively
article word count
5 purposes 12
5 data 10
5 processed 7
5 personal 6
5 manner 3
6 processing 25
6 data 21
6 personal 10
6 subject 9
6 controller 8
32 controller 4
32 data 4
32 personal 4
32 processing 4
32 access 3

3.3.3 Overview of trigrams in GDPR articles

Trigram analysis can often provide us more useful information about a text document. Figure 3.9 provides us with very special “GDPR” words. We can see “data protection officer”, which is probably the officer work for the authority. “competent supervisory authority” is probably referring to the data protection authority.

Notice that there is a term “binding corporate rules”, this represent multinational corporations are allowed to transfer personal data in a intra-organizational manner. This rules increases the risk of potential leak of personal data on other nations outside EU.

Ignore these boring article phrases, we can still find some interesting terms we are familiar with. Like the one “historical research purpose”. It suggests data analysis is under consideration of GDPR.

Figure 3.9: Word cloud of trigram in GDPR

4 Conclusion

GDPR is one of the most famous regulations on personal data protection. We do find some suspicious clustering of GDPR fines issued in the last quarter of 2019 and the first quarter of 2020. Besides, no such evidence supports that GDPR was generally used as an alternative to “digital service tax” against multinational corporations. However, one particular corporation, Vodafone, was targeted by the Spanish Data Protection Authority multiple times, which might be a special case. From the violation of articles and the text of GDPR, we find the most serious privacy issue in EU is personal data being processed unlawfully.

5 Acknowledgement

Following R packages are used in producing this research:

  • rmarkdown: Xie, Allaire, & Grolemund (2018)
  • tidyverse: Wickham et al. (2019)
  • kableExtra: Zhu (2019)
  • ggplot2: Wickham (2016)
  • plotly: Sievert (2020)
  • bookdown: Xie (2016)
  • wordcloud2: Lang & Chien (2018)
  • lubridate: Grolemund & Wickham (2011)
  • tidytext: Silge & Robinson (2016)
  • rnaturalearth: South (2017)

References

CoreView. (2020). Major gdpr fine tracker. https://www.coreview.com/blog/alpin-gdpr-fines-list/.

Ellis Hughes. (2020). Thebioengineer. https://github.com/thebioengineer.

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from http://www.jstatsoft.org/v40/i03/

Lang, D., & Chien, G.-t. (2018). Wordcloud2: Create word cloud by ’htmlwidget’. Retrieved from https://CRAN.R-project.org/package=wordcloud2

Lewis, J. A., Conleya, H. A., & Lee-Makiyama, H. (2020). Has europe lost both the battle and war over its digital future? https://www.csis.org/analysis/has-europe-lost-both-battle-and-war-over-its-digital-future.

Privacy Affairs. (2020). GDPR fines tracker & statistics. https://www.privacyaffairs.com/gdpr-fines/.

Regulation, G. D. P. (2016). Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46. Official Journal of the European Union (OJ), 59(1-88), 294.

Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. Retrieved from https://plotly-r.com

Silge, J., & Robinson, D. (2016). Tidytext: Text mining and analysis using tidy data principles in r. JOSS, 1(3). https://doi.org/10.21105/joss.00037

South, A. (2017). Rnaturalearth: World map data from natural earth. Retrieved from https://CRAN.R-project.org/package=rnaturalearth

Tidy Tuesday. (2020). A weekly social data project in r. https://github.com/rfordatascience/tidytuesday.

Tikkinen-Piri, C., Rohunen, A., & Markkula, J. (2018). EU general data protection regulation: Changes and implications for personal data collecting companies. Computer Law & Security Review, 34(1), 134–153.

Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Xie, Y. (2016). Bookdown: Authoring books and technical documents with R markdown. Boca Raton, Florida: Chapman; Hall/CRC. Retrieved from https://github.com/rstudio/bookdown

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. Boca Raton, Florida: Chapman; Hall/CRC. Retrieved from https://bookdown.org/yihui/rmarkdown

Zhu, H. (2019). KableExtra: Construct complex table with ’kable’ and pipe syntax. Retrieved from https://CRAN.R-project.org/package=kableExtra